-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add union
structure family
#668
base: main
Are you sure you want to change the base?
Conversation
I like the features of this PR. I wonder whether we can consolidate some naming consistency around the behaviors of:
On the other hand, this might already be as simple as it gets, and I just need a minute to get comfortable with the usage. :) |
This is perhaps a bit too reductive a statement, as really this contains structure and named sources and can iterate through the members. |
Even though the Then the above could be used something like this...maybe? client['x'].parts # <UnionParts {'table', 'C'}>
client['x'].data_sources() # low-level storage details, encoded in DataSource and Asset
client['x']['C'].data_sources() # one-element list, or would data_source() be better?
PUT /api/v1/array/full/x?part=C ...BODY # Refer to the union member, rather than its data source |
I like that suggestion very well, and I like the name In the future (months away) I hazily foresee enabling Tiled to track multiple versions of data:
This is why But, this also underlines why separate |
I also like that giving a distinct name to this concept helps clarify which abstraction level you are operating at. Referring to a |
Rebased on The to-dos...
I would at least like the validate this branch by connecting it to be a Bluesky document stream before merging. are, I believe, strictly additive and could be done in separate PRs or in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be very useful.
I've noted a few comments/questions. Additionally, should this 404 in PUT /awkward/full
be handled by passing StructureFamily.awkward
to SecureEntry
, as you've updated for the other routes?
Lines 1328 to 1338 in a407021
@router.put("/awkward/full/{path:path}") | |
async def put_awkward_full( | |
request: Request, | |
entry=SecureEntry(scopes=["write:data"]), | |
deserialization_registry=Depends(get_deserialization_registry), | |
): | |
body = await request.body() | |
if entry.structure_family != StructureFamily.awkward: | |
raise HTTPException( | |
status_code=404, detail="This route is not applicable to this node." | |
) |
Co-authored-by: Padraic Shafer <[email protected]>
Co-authored-by: Padraic Shafer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a random thought on naming. What you're building sounds very close to a view
in SQL terminology...something that acts like a flat table but is backed by querying a subset of fields one or more joined tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to expand on that thought...would it be useful to have a view
rather than a union
? If I know ahead of time that there will be cases during data analysis that will require 100 of the 1000 fields avaialble, maybe I could define the view with just those fields, and when pulling them out of tiled avoid marshalling the 900 unused fields.
Rethinking this in terms of a "view" is very interesting. Off the top of my head, I like that:
|
For example, maybe this is what a event stream could look like. Notice that:
|
Either path we take, union or view, would be speculative. We have to add this and really try it to understand how it flies. There is some risk either way. View, in addition to being a more widely recognized concept, could solve a broader category of problems for us. BlueskyRun is very nested, and this can mean a lot of clicks in the UI. I can imagine a view (or views) of the good parts, flat. Views could be combined with containers to create a nested-but-not-THAT-nested structure. (Dare I call it a "projection"?) I think we should keep the specification light and focused on current requirements, but I can see a lot of useful scope in this direction. Maybe I’ll start by opening a parallel PR for comparison that builds on this branch but refactors union into view. |
This is why we keep @dylanmcreynolds around. :-) |
Could If we took this |
if (catalog_adapter.structure_family == StructureFamily.union) and len( | ||
segments[i:] | ||
) == 1: | ||
# All the segments but the final segment, segments[-1], resolves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# All the segments but the final segment, segments[-1], resolves | |
# All segments except the final segment, segments[-1], |
That sounds prudent.
Do you envision that views might evolve to include keys from other nodes (ancestors, siblings, 2nd-cousin-once-removed), or is that something that should be firmly disallowed? I can imagine complications arising from access policy as well as latency/timeouts from trying to include too many keys. |
I think there will be significant pressure to enable views that reach anywhere across the tree:
The specification of a union node involves listing data sources directly. If the specification of a view involves instead referencing other nodes, I think that access control is manageable, and the scaling can be managed if we enforce reasonable limits on the number of nodes allowed in one view. |
More generally, would merged views only work if all parts have the same length (number of rows)?...and if so would it enforce that by:
Or should views require a key to join upon, using merge behavior such as |
So rather than injecting that info into the run documents, you're suggesting to instead let tiled handle that when the data gets accessed? That makes a lot of sense. Run documents would be more "pure"--less coupled to how someone thought they should be viewed when they were recorded. When new views get dreamed up, they could be added to the tiled server config--restarting the server (or maybe registering the new view with the catalog) would allow that view to be applied to all data new and old. |
Maybe. The projections schema was added to the run But I only know of one case where projection/projectors were used since they were developed four years ago. Maybe that's a sign? Perhaps the issue is they weren't needed much, perhaps they weren't advertised well, or perhaps the mechanism was too complicated.
That's an interesting thought. I feel like that if it kept the scope in check, I'd be happy to say that a |
To fit our present use case, we would need a view to look a lot like union, mixing tables and arrays in one namespace. There would be [edit: NO] special guarantees about the relation between the items in the namespace, nothing about their length or how to join them. (It goes without saying that the constitutive tables would each internally have the normal length guarantee that it is made of whole rows.) I think the change from union is, the parts in a view would have their own canonical locations in the tree, as normal nodes. A view becomes an additional place to get (mixtures of…) nodes. Each views look a lot like a union would have, but their parts are pointers to first-class nodes, not to captive data sources. This enables us to separately build multiple views on the same data. And it avoids placing the canonical data in a weird structure that would require explanation (union). As far as "projections" goes, I like that this presents clients with the result rather than instructions (in a novel descriptive language…) for rearranging a structure on the client side. Yes, one can imagine constructing views dynamically through special adapters added to the server config—I think this is what @padraic-shafer’s last message envisions. For our present requirements I would start, though, by building them as static entities. The client specifies, via a HTTP request, "Add a view node to the database that combines array node X and columns A, B from table node Y into one namespace and present it to clients. |
I think this Union PR didn't yet add the capability to read the combined result at once, right? So what do we think should happen when the outer dimension of array X differs from the length of table Y? (or equivalently when I might be quibbling about edge cases. But I wonder about how 'edge'y these cases are. We could of course enforce this when the view node (meaning the outer node, not the projection node) is created, and then wait to see if it runs into issues during testing. |
For our present requirements we can keep this pretty limited, and spin out further discussion on whether and how to expand it once we have something to play with. |
Misc cleanup-prep changes from bluesky#668
* Sort enum members. * Move structure_family to the end, matching migration result. * Sort * Only set include_data_sources param if not default (false). * Refactor link-writing into separate module. * Refactor structure family check into dependency * remove server.core.FULL_LINKS * Clarify default data_sources for CatalogNodeAdapter * Add structure_families to dependency check for put '/[node|table]/full' * Create node uses structure_family of data_source * Be robust against __doc__ being None. * Use dask.dataframe.core, not new dask-expr. --------- Co-authored-by: Padraic Shafer <[email protected]>
* Sort enum members. * Move structure_family to the end, matching migration result. * Sort * Only set include_data_sources param if not default (false). * Refactor link-writing into separate module. * Refactor structure family check into dependency * remove server.core.FULL_LINKS * Clarify default data_sources for CatalogNodeAdapter * Add structure_families to dependency check for put '/[node|table]/full' * Create node uses structure_family of data_source * Be robust against __doc__ being None. * Use dask.dataframe.core, not new dask-expr. --------- Co-authored-by: Padraic Shafer <[email protected]>
This builds on commits from #661 and should be merged after it.[Update: #661 is in, and this has been rebased.]Problem statement
This PR is designed to solve the same problem that the pandas
BlockManager
12 solves: presenting data in a flat namespace to the user, but enabling groups of items in that namespace to be transparently backed by shared data structures, for better performance.For example, data coming from Bluesky includes data stored directly in the Event documents and large data written externally by detectors as arrays. The data in the Event documents is a great fit for tabular storage and transfer formats (e.g. Feather, Parquet, even CSV...). The externally-written data is not; it is better stored and transferred in large N-dimensional array formats like Zarr, HDF5, or a simple C-ordered buffer.
Users focused on science would like to smooth over these details. That is, we want to store and (often) move the data like this:
But offer a way to model it to the user in a flat namespace:
When writing (especially appending) the client will want to use the former view, so both views need to be available.
Solution
This PR adds a new structure family,
union
. The name is inspired by AwkwardArrayUnionForm
. It holds a heterogenous mixture of structures (e.g. tables and arrays). It enables the columns of the table and the arrays to be explored from a flat namespace. Name collisions are forbidden. But it also describes the underlying structures individually, enabling them to be read or written separately.To the user, this behaves much like a
Container
structure, would:I can, for example, access fields by key and download data:
Digging a little deeper, we can see a difference from Containers. The
union
shows thatA
andB
are backed by atable
(coincidentally named"table"
, could be anything) whileC
is standalonearray
.The
structure
of theunion
node reveals more detail; expand to view:Unlike
container
, theunion
structure always describes its full contents inline. It does not support paginating through its contents. It is not designed to scale beyond ~1000 fields.This script shows how the
union
was constructed. Code like this will rarely be user-facing; envision it wrapped in a utility that consumes Bluesky documents and writes and registers the relevant data into Tiled.The requests look like:
The query parameter
?data_source={name}
is used to address a specific component backing the node.Review of abstraction levels
DataSource
andAsset
.To Do
read()
onUnionClient
(i.e.c['x']
) itself, which could pull each data source in turn and return anxarray.Dataset
.GET /union/full/{path}
to enable bulk download. This will work similar to container export.Footnotes
https://uwekorn.com/2020/05/24/the-one-pandas-internal.html ↩
https://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/ ↩